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Abstract 

High-performance next-generation sequencing (NGS) technologies are advancing genomics and molecu- 
lar biological research. However, the immense amount of sequence data requires computational skills and 
suitable hardware resources that are a challenge to molecular biologists. The DNA Data Bank of Japan 
(DDBJ) of the National Institute of Genetics (NIG) has initiated a cloud computing-based analytical pipeline, 
the DDBJ Read Annotation Pipeline (DDBJ Pipeline), for a high-throughput annotation of NGS reads. The 
DDBJ Pipeline offers a user-friendly graphical web interface and processes massive NGS datasets using decen- 
tralized processing by NIG supercomputers currently free of charge. The proposed pipeline consists of two 
analysis components: basic analysis for reference genome mapping and de novo assembly and subsequent 
high-level analysis of structural and functional annotations. Users may smoothly switch between the two 
components in the pipeline, facilitating web-based operations on a supercomputer for high-throughput 
data analysis. Moreover, public NGS reads of the DDBJ Sequence Read Archive located on the same supercom- 
puter can be imported into the pipeline through the input of only an accession number. This proposed pipe- 
line will facilitate research by utilizing unified analytical workflows applied to the NGS data. The DDBJ 
Pipeline is accessible at http://p.ddbj.nig.ac.jp/. 

Key words: next-generation sequencing; sequence read archive; cloud computing; analytical pipeline; 
genome analysis 



1 . Introduction 

Next-generation sequencing (NGS) is an increasingly 
important technology in genome and molecular 
biology research, partly because of its rapidity, preci- 
sion, and cost effectiveness. 1 ~ 4 NGS technology allows 
several analyses such as resequencing, de novo assembly 
of genomes, transcriptome analysis, Chromatin 
Immunoprecipitation (ChIP) sequencing, and exome 
analysis. 5 With ever-decreasing sequencing costs, NGS 



read datasets can now reach terabase sizes. These 
massive sequencing datasets demand high-perform- 
ance computational resources, rapid data transfer, 
large-scale data storage, and competent data analysts. 
This increase in scale appears to impede data mining 
and analysis by researchers. 

The DDBJ Sequence Read Archive (DRA), released in 
2009, is a data archive for NGS raw reads that has 
been maintained at the DNA Data Bank of Japan 
(DDBJ) of the National Institute of Genetics (NIG). 6,7 
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The DRA is a global provider of public nucleotide 
sequences in partnership with the International 
Nucleotide Sequence Database Collaboration 
(INSDC) 8 consisting of the Sequence Read Archive 
(SRA) of the National Center for Biotechnology 
Information (NCBI) in the USA 9 and the European 
Read Archive (ERA) of the European Bioinformatics 
Institute (EBI) in Europe. 10 Researchers may wish to 
reuse massive read datasets in DRA; however, their 
DRA file size tends to be too large to be downloaded 
to a local computer. 

A computational system known as the 'cloud', con- 
sisting of data service provided via the Internet, was re- 
cently developed. Cloud computing allows users to avail 
services provided bydata centres without buildingtheir 
own infrastructure. The infrastructure ofthedata centre 
isshared bya large numberof users, reducingthecostto 
each user. To manage the flood of NGS data, several 
large-scale computing platforms have been recom- 
mended. 1 1-13 Clustercomputing is performed by mul- 
tiple computers typically linked through afast local area 
network and functioning effectively as a single com- 
puter. Grid computing is performed by loosely 
coupled networked computers from different adminis- 
trative centresthat work togetheron common comput- 
ing tasks. Cloud computing is the computing ability that 
abstracts away the underlying hardware architecture 
and enables convenient on-demand network access 
to a shared pool of computing resources that can be 
readily provisioned and released. In particular, a 
model of cloud computing, Software as a service 
(SaaS), is referred to as 'on-demand software' and is 
available via a web browser. Cloud computing is a 
system uniting clusters of personal computers linked 
together similar to grid computing. The hallmark of 
cloud computing is that the users can perform compu- 
tation across the Internet, without the necessity of 
understanding the underlying architecture. 

The DDBJ Read Annotation Pipeline (DDBJ Pipeline) 
was released in 2009 with the aim of supporting 
users wishing to submit NGS data analysis results to 
the DDBJ database, a cloud computing-based analysis 
pipeline for DRA NGS data. This pipeline comprises 
two analytical components: a basic analytical process 
of reference mapping and de novo assembly and a 
process of multiple high-level analytical workflows. 
The main workflows of the high-level analysis offer 
structural and functional annotations. The DDBJ 
Pipeline, which is a web application based on the SaaS 
model of cloud computing, assists in the submission 
of analysed results to DDBJ databases by automatically 
formatting data files and facilitates the web-based 
operation of NIG supercomputers for high-throughput 
data analysis. Although conventional web-based 
genome-analysis pipelines, such as NCBI Prokaryotic 
Genomes Automatic Annotation Pipeline (PGAAP) 14 



and Rice Genome Automated Annotation System 
(RiceGAAS), 1 5 perform genomic annotation of a draft 
sequence, their main target is Sanger-based sequence 
reads in small datasets. In contrast, the DDBJ Pipeline 
processes multiple datasets of terabase size using 
the computational resources of NIG supercomputers 
(the system is introduced in http://sc.ddbj.nig.ac.jp/ 
index.php/en/). 

In this report, we introduce the DDBJ Pipeline system 
with respect to its hardware and software configuration 
and outline its usage statistics since 2009. At present, 
the NIG supercomputer time is provided free of 
charge. We believe that provision of computational ser- 
vices for NGS data analysis without cost to users will in- 
crease the use of public data and accelerate data 
submission to public databases. 

2. Materials and methods 

2.1. Bask analysis 

The pipeline accepts single- or paired-end reads in 
FASTQ 16 format and simple metadata describing the 
organism and experimental conditions associated 
with the reads. The type of sequencer is immaterial, pro- 
viding that the data format of the reads is followed. 
Users submit their NGS data and XML-formatted meta- 
data to the DRA. They may also subm it the NGS data to a 
DDBJ Pipeline directory. Userssubsequentlyanalysethe 
NGS data in the pipeline using the accession numbers. 
The FASTQ-for matted sequences and metadata are 
loaded from the DRA databases. The DDBJ Pipeline 
allows pre-processing by trimming low-quality bases 
from both ends of the reads. The pre-processing func- 
tion returns statistics and figures describing read qual- 
ities by read position, and these outputs enable users 
to set trimming parameters. The FASTQ files are used 
either for genome mapping or for de novo assembly. 
The basic analysis supports various mapping and 
de novo assembly tools for the NGS data according to 
the user's preference. (Analytical programme tools 
hosted in the DDBJ Pipeline 1 7-36 are listed in Table 1 ). 
Optional analytical parameters can be selected. 
Sequential commands from pre-processing to output- 
ting analytical results are preset for easy operation. 
Reference data, such asthe relevant genome sequence, 
can be retrieved from DDBJ databases by the Simple 
Object Access Protocol (SOAP). 37 Users can confirm 
the error rate by read position and can trim low- 
quality bases from the reads. The numbers of mapped 
reads, genome coverage, depth, and maximum contig 
length are reported. Output files from all processing 
stages including SAM-formatted files, 24 if supported 
by the tool, can be downloaded from an FTP server. A 
multiple FASTA file, which is convenient for subsequent 
submissions to the whole-genome shotgun (WGS) 
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section of DDBJ, is built on the basis of consensus 
sequences from mapping or contig files from de novo 
assembly. The basic analysis system is built mainly 
using Perl 5.8, Java 6, PostgreSQL 8.3, and gnuplot 
(http://www.gnuplot.info/). Mapping and de novo as- 
sembly are performed on NIG supercomputers using 
704 8-core 2.60-GHz Intel Sandy Bridge CPUs with 
64 GB RAM and 1 .6 TB storage, and 96 8-core 2.66- 
GHz Intel Xenon CPUs with 1 0 TB RAM, respectively. 

Mapping benchmarks in the basic analysis were cal- 
culated using the whole-genome NGS data from the 
Japanese rice cultivar'Omachi' (paired-end readsof ac- 
cession number DRR0007 1 9) and using the complete 
genome sequences of the Japanese rice cultivar 
'Nipponbare' as reference (accession numbers 
NC_008394-NC_008405). De novo assembly was 
performed using whole-genome NGS reads of 
Escherichia coli NDM1 Dol<01 (paired-end reads of ac- 
cession number DRR001 003). 

2.2. High-level analysis 

Because advanced analysis after mapping and de novo 
assembly requires several workflows with variable func- 
tions, the high-level analysis system was mainly designed 
to use the Galaxy interface, 30 a genomic workbench with 
a graphical user interface. To date, single-nucleotide 
polymorphism (SNP) analysis, transcriptome analysis 
(RNA-seq), and ChlP-sequencing have been implemen- 
ted using SAMtools, 24 Cufflinks, and MACS, 33 respect- 
ively (Table 1). These analyses are performed using 
mapped results in the SAM format generated by the 
basic analysis. Users can modify parameter settings 
through the graphical user interface and execute the 
analysis flows repeatedly using Galaxy's Workflow and 
History methods. For SNP analysis, a figure showing the 



frequency distribution of SNPs over the entire genome 
can be produced. For transcriptome analysis, mapped 
results are sent to Cufflinks to quantify gene structures 
and expression values. In addition, these results can be 
visualized by genomic regions linked to the UCSC 
genome browser site (http://genome.ucsc.edu/). 

The high-level analysis system requirestheGalaxyen- 
vi ron me nt, Cairo (http://www.cairographics.org/), and 
Perl modules from CPAN (http://www.cpan.org/) for 
graphical output. The analysis is performed on the 
same nodes as the mapping, using the 704 8-core 
2.60-GHz Intel SandyBridge CPUs with 64 GB RAM 
and 1 .6 TB storage. 

3. Results and discussion 

3.1 . System configuration of the proposed pipeline 
The system outline of the DDBJ Pipeline is summar- 
ized in Supplementary Fig. S1 . Apache Tomcat and DB 
servers for the DDBJ Pipeline run on an NIG supercom- 
puter, and an FTP server that handles data import and 
export and resides outside the supercomputer. Reads 
imported from the DRA are sent via its built-in FTP 
server. 

The pipeline is built as a cloud computing-based web 
application, and its flow follows two steps. The basic 
analysis receives transferred reads and maps them to 
reference genomes or assembles them. The high-level 
analysis generates results closer to the research goals, 
such as genome contig construction, SNP detection, or 
expression analysis. 

NGS data are transferred either to an analysis server 
for basic analysis or to Galaxy interface servers for 
high-level analysis, both residing within an NIG super- 
computer. Classified on the basis of purpose, the data 
are analysed by the supercomputer nodes using the 
qsub command of the UNIVA grid engine. 

3.2. A pipeline for high-throughput analysis of NGS data 
In the basic analysis, the DDBJ Pipeline provides the 

following useful functions: (i) data transfer: at the start 
of analysis, users can specify three methods for query 
data: FTP uploading, secure copy from DRA if the data 
have been pre-registered to DRA, or HTTP uploading. 
If users wish to use public data as query data, they 
may choose directory upload from the DRA, whose 
data are shared with SRA and ERA. Public data may be 
used not only as query data, but also as reference 
sequences for mapping, (ii) Pre-processing in the form 
of trimming off low-quality parts of sequence reads: 
basecalling quality is not uniform and may influence 
mapping or assembly quality. Although trimming off 
less accurately identified bases is effective to maintain 
thequality,several analysis toolscan be usedasoptional 
functions. 38-41 The DDBJ Pipeline outputs trimmed 
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reads and figures showing the distributions of read 
q u a I i ty sco res. ( i i i) Pa ra m ete r c h a n ges to sof t wa re co m - 
ponents:the DDBJ Pipeline allows modification of some 
options and parameters of the software and allows 
users to limit reads to uniquely mapped reads in the 
SAM file by removing multiread sets (Fig. 1 ). (iv) 
Confirming job status and ensuring confidentiality: 
the DDBJ Pipelinecommunicateswith web applications 
to analyses the NGS data using DDBJ supercomputers 
and currently supports 1 1 mappingorde novo assembly 
software packages (Table 1). During pre-processing, 
mapping, or de novo assembly on the supercomputer, 
users can confirm the status of their operation 
through a web browser (Fig. 2). The user's jobs are 
listed along with their status ('running', 'complete', 
'error', etc.) and elapsed times. When the jobs are com- 
pleted, the DDBJ Pipeline notifies users by e-mail. 
Output is not limited to SAM-formatted files as 



[Vol. 20, 

mapping results or FASTA files as assembly results, but 
includes intermediate files, work logs, and statistical 
data, including mapping coverage and depth or N50 
contig size of assemblies. The DDBJ Pipeline and the 
NIG supercomputer system, which execute the pipeline 
jobs, may be accessed by an unspecified number of 
users. To protect user confidentiality, the DDBJ Pipeline 
does not allow users to identify any other user except 
for the demo user, and users may never access each 
other's queries except for public data and results. 

As an example of a benchmark of the system, the 
DDBJ Pipeline enables the mapping of 34.7 million 
75-base NGS reads to a 383-Mb reference genome 
using BWA program 18,19 in 6.5 h and can assemble 
24.4 million 80-base paired-end WGS reads in 
1 0.5 min withSOAPdenovo, 27 usingthecloud comput- 
ing system. NGS technologies reduce the cost and time 
required for sequencing, and the resultingdata increase 
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Step6) Analysis for Depth, Coverage 

samtools sort -o out.bam outsorted. bam 

samtools pileup -c -f reference.fa outsorted. bam > out.pileup 

perl pileup_for_CoverageDepth.pl out.pileup reference.fa 

* This command does not appear in the list. 

Step7) Create assembled sequences in FASTA file from pileupped reads to submit WGS division of DDBJ. 



pert getConsGeno_4pipeline.pl pileupFile I Not to include insertion of pileupped reads ; I out_WGS.txt 
* Threshold of insertion of pileupped reads: the quality threshold for indets <= 50 and allele constitutes 80% of pileupped reads. 



Figure 1 . Interface for modifying the settings of analysis tools in basic analysis of the DDBJ Pipeline. 
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Figure 2. Job status list in basic analysis of the DDBJ Pipeline.Jobs executed in the DDBJ Pipeline are shown in lists, and users may manage the 
jobs, for example, by downloading results or by halting the jobs. The bars at the right end of the list indicate elapsed times. 



the submissions to public archives. Current develop- 
ments in NGS technologies are leading to increases in 
both read length and the number of reads, and novel 
biological strategies are being developed to utilize the 
sequencing systems. The expansion of read numbers 
requires expanded computational resources, particu- 
larly for denovo assembly. 42 The use of a computational 
cluster system allows decentralized processing, result- 
ing in scalability for efficient analysis of NGS data. The 
DDBJ Pipeline supports not only the use of public 
domain data, but also the submission of mapping and 
assembly results to the WGS division of DDBJ in FASTA 
format. Basic analysis of the DDBJ Pipeline is accessible 
at http://p.ddbj. nig.ac.jp. 



3.3. Genome-wide annotation by the high-level analysis 
of NGS data 

The DDBJ Pipeline supports not only basic analyses, 
such as mapping, but also high-level analyses via the 
Galaxy interface, which has the advantage of modifia- 
bility and easy maintenance. 30 User data such as login 
accounts, e-mail addresses, and passwords for access 
to Galaxy are shared with those of basic analysis. 
Therefore, basic analysis results, which are mapping or 
assembly jobs identified by job IDs, can be imported 
into Galaxy. 

The high-level analysis has recently been augmented 
with the following four analysis methods: 

(i) SNP detection: Users may view the pileupdata pro- 
duced by the basic analysis and identify SNPs using 
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ANNOVAR. 3 ' They may also view figures showing 
SNP distribution on the genome (Supplementary 
Fig.S2Aand 2B). 

(ii) RNA-Seq analysis: Expression analysis using 
TopHat 23 produces a SAM file that is subsequently 
processed using Cufflinks. 32 Downloaded 
Cufflinks results are sent to the UCSC genome 
browser site (http://genome.ucsc.edu/cgi-bin/ 
hgGateway), 43 allowing the user to identify read 
expression patterns within genome-wide images. 
Cuffcompare, an analysis function of Cufflinks, is 
also available. 

(iii) ChlP-Seq analysis: The DDBJ Pipeline supports 
MACS 33 for ChlP-Seq analysis. 

(iv) Annotation ofcontigs by de novo assembly: A length 
filter is applied to remove short fragments. For 
gene finding in contigs, gene prediction tools, 
such as GENSCAN for eukaryote and 
GeneMarlchmm for prokaryote data, 34,35 are 
applied. BLASTX is used for similarity searching 
against known proteins 36 (Supplementary Fig. 
S2C). Supplementary Figs S2A, S2B, and S2C 
showing the whole-genome distribution of SNPs 
or the functional annotation of assembled 
contigs provide researchers with inspiration for 
new discoveries. In addition, new strategies for 
applying NGS technologies to novel biological 
analyses will be developed in the future. 
Therefore, the high-level analysis has the flexibility 
to be modified. 

The high-level analysis of the DDBJ Pipeline can be 

accessed at http://p-galaxy.ddbj.nig.ac.jp. 

3.4. Usage statistics of the DDBJ Pipeline 

Asa beta version offeringonly the basicanalysis com- 
ponent, the DDBJ Pipeline has been open to the public 
via the Internet with updates since August 2 009. 
Some analytical tools have been replaced according to 
the frequency of their use since then. From the start of 
recording in June 201 0,the numberof jobs submitted 
wasaround 1 800 (Table 2), not considering those used 
for development and demonstration. The number of 
mapping jobs (1 428) was nearly quadruple that of de 
novo assembly jobs (326). 

3.5. Building an environment supporting the use of 
NGS data by biologists 

Deposits of NGS data in public databases (DRA, ERA, 
and SRA) are rapidly increasing each year. 6 However, 
the NGS database has been used only as a data reposi- 
tory. Bioinformaticians use the data to test their own 
computeranalysis programmes, whereas general biolo- 
gists lacking computational skills rarely use the huge 
and unwieldy datasets. In this report, we present an 
example of how biologists may use public NGS data to 
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Table 2. Job numbers for the basic analysis of the DDBJ Pipeline 
(since June 201 0) 



Year 


Pre- 
processing 


Mapping 


de novo 
Assembly 


Total 


201 0 


a 


674 


35 


709 


201 1 


1 1 


310 


1 52 


473 


201 2 (January- 
September) 


33 


444 


1 39 


61 6 


Total 


44 


1428 


326 


1 798 



a Pre-processing was still under construction. 



theiradvantage. Genome sequences used as references 
for re-sequencing are often updated, resulting in the 
shifting of mapped positions of reads. Researchers 
may wish to compare SNPs that have been newly 
mapped and detected by them to previously detected 
SNPs in the publicdatabase. SNP positions in databases, 
such asdbSNP, 44 based on older reference genomes will 
lead to confusion. The DDBJ Pipeline not only provides a 
computational environment for analysing NGS data, 
but also permits seamless access to the public domain 
data, including NGS short reads and complete 
genomes. Researchers familiar with the DDBJ Pipeline 
will be able to re-perform reference mapping quickly 
using public NGS reads with current genome 
sequences. It may occur that comparing or merging 
SNP data from their own dataset with the public data 
using the DDBJ Pipeline allows re-analysis with pre- 
ferred parameters. We expect the DDBJ Pipeline to 
analyse users' NGS data, thereby accelerating the sub- 
mission of NGS data to public databases such as DDBJ. 

3.6. The cloud computational system for NGS 
data analysis 

A recent cloud computing innovation is virtual 
machines (VMs), which are programmes that perform 
parallel processing to overcome differences between 
server platforms. VM technology has been used in bio- 
informatics. 1 1 ~ 1 3 CloVR 45 has been developed for ana- 
lysing bacterial NGSdata,and the Rseq Flow workflow 46 
processes, RNA-Seq data. Cloud computing with VM 
expands genome informatics, and more tools will 
appear in future. Although the DDBJ Pipelineaccommo- 
dates individual users'data,some biologistswishtohost 
NGS analysis packages on their local servers to keep 
their data private. Therefore, we are studying the 
future incorporation of a VM package into the DDBJ 
Pipeline. 
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